ENGLISH 



A Fast, Vectorizable Algorithm for Producing 
Single-Precision Sine-Cosine Pairs 

Marcus H. Mendenhall 
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Abstract — This paper presents an algorithm for com- 
puting Sine-Cosine pairs to modest accuracy, but in a 
manner which contains no conditional tests or branching, 
making it highly amenable to vectorization. An exem- 
plary implementation for PowerPC AltiVec processors is 
included, but the algorithm should be easily portable to 
other achitectures, such as Intel SSE. 

Index Terms — Elementary function approximation. Nu- 
merical algorithms. Parallel Algorithms, Mathematical 
Software / Parallel and vector implementations 



I. Introduction 

THE need for sine-cosine pairs arises often in 
numerical computing and digital signal pro- 
cessing. For many of the applications, the functions 
are needed at uniformly spaced values of the argu- 
ment, and can be generated by efficient and accurate 
recursion relations. 

For some applications, though, one needs to com- 
pute trigonometric functions in random order, or for 
non-uniformly spaced values. If a large number of 
such evaluations are needed, it would be convenient 
to be able to take advantage of the extremely fast 
vector processing units included on many mod- 
ern desktop computer families. Unfortunately, most 
common algorithms require that one reduce the 
angle modulo 7r/2 onto the range [— vr/4, 7r/4) (or 
sometimes mod n onto [—n/2,n/2)) [1], then to 
compute the function using some reasonable power 
series, and then to fill in the rest of the circle 
by appropriate exchanges and sign changes. The 
shuffling at the end to fill in all four quadrants is 
not easy to do if multiple angles are being computed 
at once in a vector processor, since each quadrant 
requires a different shuffle. 



II. Methodology 

The approach taken in this work is slightly dif- 
ferent than the approach used historically. The goal 
of it is to avoid the relatively expensive testing and 
shuffling resulting from the reduction of the angle 
to one quadrant. To accomplish this, the angle is 
first scaled by l/27r and reduced modulo 1 onto 
[—1/2, 1/2) by rounding and subtraction. Then, a 
power series of the functions of an angle scaled 
down by a power of two is computed, and the 
resulting trig pair is shifted back onto the whole 
circle using double-angle formula recursion. 

Computing functions in this manner has a number 
of useful properties. As discussed, it is highly vec- 
torizable. It also produces a very smooth result; that 
is, except for the error when the modular conversion 
wraps around at (2n + l)n, there are no seams 
at which one must eliminate discontinuites. Thus, 
algorithms which depend on differentiation (e.g. 
nonlinear least- squares fitting) will not be likely to 
be disrupted by the joining. In reality, most lEEE- 
754 compliant algorithms put a great deal of effort 
into avoiding such discontinuities by assuring the 
error in the computation is less than 0.5 LSB. This 
effort, though costs time. The method described 
here, instead, sacrifices a little absolute accuracy 
in trade for extremely high speed, while preserving 
smoothness. 

In the implementation shown in this paper, I 
compute sin(27rx/4) and cos(27rx/4) where x is the 
scaled and reduced angle as discussed above, and 
then use 2 angle-doublings to expand this over the 
entire circle using: 



sin(2^) 
cos(2^) 



2 sin 6 cos 6 
cos^ 9 — sin^ 9 



(1) 



The issue that needs to be addressed when carry- 
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Since this algorithm is designed to generate single- 
precision functions, and to operate on a single- 
precision vector processing unit, one must pay close 
attention to instabilities in this recursion due to 
roundoff error. In fact, for this algorithm, there were 
two candidates for the angle doubling. The first is 
the one above, and the second is 



sin(2e) 
cos(2^) 



2 sin 9 cos 9 
1 - 2 sin^ 9 



(2) 



Each of the two methods has an instability, but 
in the chosen method it is easily fixed at the end, 
and in the rejected method it is not so. 

Consider the case of a small error introduced: 
define 

si = a sin(^ + 5), ci = a cos(6' + 5) 

and iterate once using ([T]) to get: 

S2 = 2a^cos{9 + 5)sin{9 + 5)=a^sm2{9 + 5) 
C2 = a^cos^(6' + (5) -a^sin^(6' + (5) 
= a^cos2{9 + 5) 

Note that, then, the magnitude has drifted, but the 
error in the angle is stable. 

Now, consider the case using ©: 



S2 = a^ sin 2(6' + 5) 

C2 = l-2a^siYi^{9 + 5) 

= a" (cos2(0 + ,5) - sin2(^ + S)) + (1 - a") 

= a^ cos2{9 + 5) + {1 - a"^) 

thus mixing the angle and amplitude error inextri- 
cably. 

Thus, the form of ([T]) produces a pair (after n 
iterations) of a2"cos2"(^ + 6), a2"sin2"(e + 6) 
which has the same relative error in the angle as 
the initial estimate, but a scale factor. Since the 
scale factor has no effect on the accuracy of the 
properly-normalized trig pair, there are two options 
for computing the raw pair. These options differ in 
the fitting algorithm used to find the coefficients for 
the series. 

The first method tries to produce a pair which 
has a very small normalization error by directly 
computing least-squares series for sine and cosine. 
Assuming the original amplitude error is small, 
so a = 1 + e, then a^" ^ 1 + 2ne. Thus, the 



final amplitude has a relative error 2n times big- 
ger than the initial error. For n = 2 and single- 
precision arithmetic, this implies an amplitude er- 
ror of about 4 X 10^^ (assuming 1 LSB error at 
the start). To correct for this at the end, compute 
■^n + c^ = 1 + a and divide the function pair by 
y/1 + a. In practice, since a is small, multiplying 
the pair by 1 — a/2 = (3 — s^ — c^)/2 provides 
a correctly normalized result. Using this method, 
the RMS error is 1.2 x lO^^and the maximum 
e rror is 4.8 x 10~^. These errors are meas ured as 
a/(cos 9 — ref_ cos 9)"^ + (sin 9 — ref_ sin 9)'^ where 
the reference functions are computed to double 
precision using the standard system library calls. 
The maximum amplitude error, computed as 1 — 
Vcos^ 9 + sin^ 9 is 1.8 x 10^''. For this method, the 
polynomials: 

sin^/4 = x( 1.5707963235 
-0.645963615x2 
+0.0796819754 x^ 
-0.0046075748 x*^) 

cos 0/4 = 1 

-1.2336977925x2 
+0.2536086171 x^ 
-0.0204391631 x^ 

are used, where x is the range-reduced value of 
9/271 discussed at the beginning of this paper. 

The second method relaxes the requirement that 
the magnitude error be small, by fitting the input 
angle to arctan(sin0/cos6^), but requires then an 
accurate division by the actual magnitude of the 
pair at the end. Doing this produces a slightly better 
errror budget, with an RMS error of 9.8 x 10~^ 
and a maximum error of 3.8 x 10~^. The penalty 
in the final algorithm is doing a precise reicprocal 
and multiply, which costs about 5% in speed. For 
this method, the polynomials: 

sin^ 9/4: = x( 1.5707963268 
-0.6466386936x2 
+0.0679105987 x^ 
-0.0011573807x'^) 

cos^9/4: = 1 

-1.2341299769x2 
+0.2465220241 x^ 
-0.0123926179 x^ 
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are used. Note that calling these sine and co- 
sine is somewhat dangerous; they are really the 
numerator and denominator of a good rational- 
function representation for tan 6/ A but do not satisfy 
sin^ + cos^ = 1.1 have annotated the terms in the 
equations with the (f) symbol to remind the reader 
of this. 



[2] "AltiVec Technology Programming Interface Manual", Mo- 
torola SPS document ID ALTIVECPIM/D, 1999 



III. Results 

In figure [H I show a sample implementation of 
this, coded in C (compatible with gcc3.xx or IBM 
xlc). This code computes a trig pair vector (4 trig 
pairs) in 140 ns on a 500 MHz PowerPC 7400 (aka 
PowerMacintosh G4). 

It is important to note that, even for this al- 
gorithm, the throughput can be improved quite a 
bit. Because of the chain-computation nature of 
this, there are many stalls waiting for latency in 
the vector unit. If one computes two vectors at a 
time, by just interleaving this code with itself with 
different variable names, throughput rises another 
50%. However, the logistics of using this in other 
code often does not make it worthwhile. Also, 
the nature of optimizations of this type is quite 
architecture-dependent and therefore not within the 
intended scope of this work. However, with this 
optimization, one can generate a single-precision 
trig function every 7 clock cycles of the CPU (8 
pairs in 100 ns), including loading and storing the 
results. For a discussion of the instruction set and 
throughput of this particular vector unit, see [2]. 
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IV. Conclusions 

For applications requiring very high speed, 
random-access calculations of sine-cosine pairs, it 
is possible to forego a very small amount of accu- 
racy in exchange for a great deal of speed. Using 
multiple-angle formulas to expand a sine-cosine pair 
computed on a small piece of the reduced range is 
an effective approach to this. The author has been 
applying this method to the rapid computation of 
optical diffraction patterns via Huygen's principle, 
but it should be widely applicable to other problems 
for which very large numbers of single-precision 
trigonometric pairs are needed. 
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/* define FASTER_SINCOS for the slightly-less-accurate results in slightly less time */ 
#define FASTER_SINCOS 

#if ! defined (FASTER_SINCOS) /* these coefficients generate a badly un-normalized sine-cosine pair, but the angle 

#define ssl 1.5707963268 

#define ss2 -0.6466386396 

#define ss3 0.0679105987 

#define ss4 -0.0011573807 

#define ccl -1.2341299769 

#define cc2 0.2465220241 

#define cc3 -0.0123926179 

#else /* use 20031003 coefficients for fast, normalized series*/ 

#define ssl 1.5707963235 

#define ss2 -0.645963615 

#define ss3 0.0796819754 

#define ss4 -0.004 6075748 

#define ccl -1.2336977925 

#define cc2 0.2536086171 

#define cc3 -0.0204391631 

#endif 

inline void FastSinCos (vector float v, struct phase *ph) 

{ 

vector float si, s2, cl, c2, fixmagl; 

vector float xl-vec_madd (v, (vector f loat) (1 . 0/ (2 . 0*3 . 141592 6536) ) , 
(vector float) (0.0)) ; 

/* ql^x/2pi reduced onto (-0.5,0.5), q2^ql**2 */ 

vector float ql^vec_nmsub (vec_round (xl ) , (vector float) (1.0), xl) ; 

vector float q2^vec_madd (ql , ql, (vector float) (0.0)); 

sl^ vec_madd(ql, 

vec_madd (q2 , 

vec_madd (q2 , 

vec_madd(q2, (vector float) (ss4), 
(vector float) (ss3) ) , 
(vector float) ( ss2) ) , 
(vector float) (ssl) ) , 
(vector float) (0.0) ) ; 
cl^ vec_madd(q2, 

vec_madd (q2 , 

vec_madd(q2, (vector float) (cc3) , 
(vector float) (cc2) ) , 
(vector float) (ccl) ) , 
(vector float) (1.0) ) ; 
/* now, do one out of two angle -doublings to get sin & cos theta/2 */ 
c2^vec_nmsub (si , si, vec_madd(cl, cl, (vector float) (0.0))); 
s2^vec_madd ( (vector f loat) (2 . 0) , vec_madd (si, cl, (vector float) (0.0)), 

(vector float) (0.0) ) ; 
/* now, cheat on the correction for magnitude drift... 
if the pair has drifted to (l+e)*(cos, sin), 
the next iteration will be (1 + e) **2* (cos, sin) 
which is, for small e, (l + 2e) * (cos, sin) . 
However, on the (1+e) error iteration, 
sin**2 + cos**2^ (1 + e) **2^1 + 2e also, 
so the error in the square of this term 

will be exactly the error in the magnitude of the next term. 
Then, multiply final result by (1-e) to correct */ 

#if defined (FASTER_SINCOS) /* this works with properly normalized sine-cosine functions, but un-normalized is mor 

f ixmagl^vec_nmsub (s2 , s2 , vec_nmsub (c2 , c2, (vector float) (2.0))); 
#else /* must use this method with un-normalized series, since magnitude error is large */ 

f ixmagl^Reciprocal (vec_madd (s2 , s2 , vec_madd (c2 , c2 , (vector float) (0.0)))); 
#endif 

cl^vec_nmsub (s2 , s2, vec_madd(c2, c2, (vector float) (0.0))); 

sl=vec_madd ( (vector float) (2.0), vec_madd (s2 , c2 , (vector float) (0.0)), 
(vector float) (0.0)) ; 

ph->c^vec_madd (cl , fixmagl, (vector float) (0.0)); 
ph->s-vec_madd (si , fixmagl, (vector float) (0.0)); 



Fig. 1. Sample Implementation for PowerPC AltiVec vector processor 



